World Health Organization

Context

Although there have been lot of studies undertaken in the past on factors affecting life expectancy considering demographic variables, income composition and mortality rates. It was found that affect of immunization and human development index was not taken into account in the past. Also, some of the past research was done considering multiple linear regression based on data set of one year for all the countries. Hence, this gives motivation to resolve both the factors stated previously by formulating a regression model based on mixed effects model and multiple linear regression while considering data from a period of 2000 to 2015 for all the countries. Important immunization like Hepatitis B, Polio and Diphtheria will also be considered. In a nutshell, this study will focus on immunization factors, mortality factors, economic factors, social factors and other health related factors as well. Since the observations this dataset are based on different countries, it will be easier for a country to determine the predicting factor which is contributing to lower value of life expectancy. This will help in suggesting a country which area should be given importance in order to efficiently improve the life expectancy of its population.

1. Does various predicting factors which has been chosen initially really affect the Life expectancy? What are the predicting variables actually affecting the life expectancy?

After running multiple regression models (forward, backward, and stepwise), it was determined that the following variables are the predictors that are statically significant in regards to life expectancy.

Variables - Adult.Mortality
- infant deaths - Total.expenditure - HIV.AIDS - Income.composition.of.resources

While the other variables intuitively may appear significant, the variables above are the predictors that have significant impact according to the regression models. interestingly enough, all models resulted in the same variables being returned.

## 
## Call:
## lm(formula = Life.expectancy ~ Adult.Mortality + infant.deaths + 
##     under.five.deaths + Total.expenditure + HIV.AIDS + Income.composition.of.resources, 
##     data = df1_complete)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -10.568  -1.569  -0.127   1.561  10.060 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     47.860575   2.066580  23.159  < 2e-16 ***
## Adult.Mortality                 -0.017405   0.003866  -4.502 1.53e-05 ***
## infant.deaths                    0.042287   0.029527   1.432 0.154625    
## under.five.deaths               -0.033561   0.022614  -1.484 0.140331    
## Total.expenditure                0.349095   0.111930   3.119 0.002258 ** 
## HIV.AIDS                        -0.809261   0.231621  -3.494 0.000661 ***
## Income.composition.of.resources 35.911609   2.502726  14.349  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.096 on 124 degrees of freedom
## Multiple R-squared:  0.8765, Adjusted R-squared:  0.8705 
## F-statistic: 146.7 on 6 and 124 DF,  p-value: < 2.2e-16
## 
## Call:
## lm(formula = Life.expectancy ~ Adult.Mortality + infant.deaths + 
##     Alcohol + percentage.expenditure + Hepatitis.B + Measles + 
##     BMI + under.five.deaths + Polio + Total.expenditure + Diphtheria + 
##     HIV.AIDS + GDP + Population + thinness..1.19.years + thinness.5.9.years + 
##     Income.composition.of.resources + Schooling + Status_dc, 
##     data = df1_complete)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.4098  -1.7264  -0.0392   1.7715   8.3880 
## 
## Coefficients:
##                                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      5.122e+01  3.314e+00  15.457  < 2e-16 ***
## Adult.Mortality                 -1.724e-02  4.148e-03  -4.157 6.36e-05 ***
## infant.deaths                    8.287e-02  5.619e-02   1.475 0.143057    
## Alcohol                          5.674e-03  9.749e-02   0.058 0.953689    
## percentage.expenditure           4.627e-04  4.639e-04   0.997 0.320716    
## Hepatitis.B                      1.205e-02  2.808e-02   0.429 0.668582    
## Measles                         -3.361e-05  4.823e-05  -0.697 0.487345    
## BMI                             -7.576e-03  2.000e-02  -0.379 0.705531    
## under.five.deaths               -6.014e-02  3.838e-02  -1.567 0.119989    
## Polio                           -8.746e-03  2.117e-02  -0.413 0.680327    
## Total.expenditure                2.878e-01  1.274e-01   2.259 0.025833 *  
## Diphtheria                       7.644e-03  3.445e-02   0.222 0.824805    
## HIV.AIDS                        -8.363e-01  2.470e-01  -3.385 0.000984 ***
## GDP                             -5.980e-05  6.656e-05  -0.898 0.370911    
## Population                      -1.729e-09  6.804e-09  -0.254 0.799816    
## thinness..1.19.years            -1.300e-01  2.267e-01  -0.574 0.567462    
## thinness.5.9.years               5.458e-03  2.227e-01   0.025 0.980489    
## Income.composition.of.resources  3.597e+01  6.228e+00   5.775 7.11e-08 ***
## Schooling                       -1.617e-01  2.740e-01  -0.590 0.556279    
## Status_dc                       -1.170e+00  1.035e+00  -1.130 0.261006    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.186 on 111 degrees of freedom
## Multiple R-squared:  0.8829, Adjusted R-squared:  0.8629 
## F-statistic: 44.06 on 19 and 111 DF,  p-value: < 2.2e-16
## 
## Call:
## lm(formula = Life.expectancy ~ Adult.Mortality + infant.deaths + 
##     under.five.deaths + Total.expenditure + HIV.AIDS + Income.composition.of.resources, 
##     data = df1_complete)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -10.568  -1.569  -0.127   1.561  10.060 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     47.860575   2.066580  23.159  < 2e-16 ***
## Adult.Mortality                 -0.017405   0.003866  -4.502 1.53e-05 ***
## infant.deaths                    0.042287   0.029527   1.432 0.154625    
## under.five.deaths               -0.033561   0.022614  -1.484 0.140331    
## Total.expenditure                0.349095   0.111930   3.119 0.002258 ** 
## HIV.AIDS                        -0.809261   0.231621  -3.494 0.000661 ***
## Income.composition.of.resources 35.911609   2.502726  14.349  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.096 on 124 degrees of freedom
## Multiple R-squared:  0.8765, Adjusted R-squared:  0.8705 
## F-statistic: 146.7 on 6 and 124 DF,  p-value: < 2.2e-16

2.Should a country having a lower life expectancy value(<65) increase its healthcare expenditure in order to improve its average lifespan?

3.How does Infant and Adult mortality rates affect life expectancy?

4.Does Life Expectancy has positive or negative correlation with eating habits, lifestyle, exercise, smoking, drinking alcohol etc.

5.What is the impact of schooling on the lifespan of humans?

Life expectancy and schooling have a positive linear relationship. Despite not being significant in the full model, schooling is a significant indicator when its modeled as a singly linear regression.

The summary statistics shows high significants of alpha less than 0.001. The relationship can be modeled by the regression below:

                      life expectancy = 38.72 + 2.51(schooling)

We notice that as schooling increases by a year, life expectancy is increased by 2.5 years, with no schooling having a life expectancy of 38.72 years.

## 
## Call:
## lm(formula = Life.expectancy ~ Schooling, data = df1_complete)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.6180  -3.2318   0.5415   3.3678   9.3898 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  38.7198     2.1350   18.14   <2e-16 ***
## Schooling     2.5086     0.1646   15.24   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.162 on 129 degrees of freedom
## Multiple R-squared:  0.6429, Adjusted R-squared:  0.6401 
## F-statistic: 232.2 on 1 and 129 DF,  p-value: < 2.2e-16

6.Does Life Expectancy have positive or negative relationship with drinking alcohol?

Life expectancy has a slight positive increase with alcohol consumption.

                      life expectancy = 67.11 + 1.11(Alcohol)

Notice that as alcohol consumption incrases, life expectancy incrases by 1.11 years, starting at 67.11 years expected if no alcohol is consumed.

This is a bit counter intuitive considering the knowledge that alcohol is not considered to be healthy and many studies suggest that alcohol could shorten life spans. Keeping this in coonsideration, additional studies may need to be conducted.

## 
## Call:
## lm(formula = Life.expectancy ~ Alcohol, data = df1_complete)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -24.6896  -4.4306   0.8788   5.4897  15.8788 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  67.1100     0.8008   83.81  < 2e-16 ***
## Alcohol       1.1140     0.1571    7.09 7.89e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.328 on 129 degrees of freedom
## Multiple R-squared:  0.2804, Adjusted R-squared:  0.2748 
## F-statistic: 50.26 on 1 and 129 DF,  p-value: 7.887e-11

7.Do densely populated countries tend to have lower life expectancy?

Initally, there is a significant outlier that skews the data dramatically. Once removed, we noticed the sloped droped by nearly 50%. Considering that there is the possibility of having significanlt high populations, it has been concluded to keep the data point in.

                      life expectancy = 70.58 - 2.65(Population)

Notice that as the population in a country increases, life expectancy decreases by 2.65 years, starting at 70.58 years expected if there is no population. In this scenario, the intercept independent of the slope has no logical reasoning considering that no population would result in no life expectancy.

## 
## Call:
## lm(formula = Life.expectancy ~ Population, data = df1_complete)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -22.477  -5.797   1.446   5.222  18.421 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.058e+01  7.680e-01  91.903   <2e-16 ***
## Population  -2.654e-09  6.488e-09  -0.409    0.683    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.633 on 129 degrees of freedom
## Multiple R-squared:  0.001295,   Adjusted R-squared:  -0.006447 
## F-statistic: 0.1673 on 1 and 129 DF,  p-value: 0.6832

## 
## Call:
## lm(formula = Life.expectancy ~ Population, data = newdata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -22.561  -5.546   1.635   5.137  18.332 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.067e+01  8.125e-01  86.978   <2e-16 ***
## Population  -1.042e-08  2.306e-08  -0.452    0.652    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.662 on 128 degrees of freedom
## Multiple R-squared:  0.001592,   Adjusted R-squared:  -0.006208 
## F-statistic: 0.204 on 1 and 128 DF,  p-value: 0.6522

8.What is the impact of Immunization coverage on life Expectancy?

Reviewing the graphs, there does not appear to be a significant relationship between life expectancy and immunization.

## 
## Call:
## lm(formula = Life.expectancy ~ ., data = df_immunizations)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -22.1505  -5.2695   0.7869   5.0334  19.8007 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5.647e+01  3.028e+00  18.648   <2e-16 ***
## Hepatitis.B -9.120e-02  6.498e-02  -1.404   0.1629    
## Measles     -2.967e-05  7.101e-05  -0.418   0.6767    
## Polio        1.172e-01  4.912e-02   2.385   0.0186 *  
## Diphtheria   1.405e-01  7.928e-02   1.772   0.0789 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.962 on 126 degrees of freedom
## Multiple R-squared:  0.1702, Adjusted R-squared:  0.1438 
## F-statistic:  6.46 on 4 and 126 DF,  p-value: 9.221e-05

Objective 1

Display the ability to build regression models using the skills and discussions from Unit 1 and 2 with the purpose of identifying key relationships, interpreting those relationships, and making good predictions.

Reminder, key here is to tell a good story.

Build Model 1

  • Identify key relationships
  • Ensure interpretability
  1. Perform regression analysis

  2. Report predictive ability
    1. Test/train set
    2. CV data
  3. Hypothesis Testing

  4. Interpret the coefficients

  5. Confidence intervals

  6. Practical and statistical significance

Model 2

- Product the best predictions as possible
- Interpretation is no longer required, hence complexity is no longer an issue
  1. Feature selection to avoid overfitting

  2. Create the model

  3. Compare model 1 vs. model 2

  4. Comment on the differences of the models and whether model 2 brings any benefit

Objective 2

- Nonparametric technique
- kNN or regression trees (select one)

Set of predictors from previous regression: (fill this out)

  1. Model

  2. A brief description of your nonparametric model’s strategy to make a prediction. Include Pros and Cons.

  3. Provide any additional details that you feel might be necessary to report.

  4. Report the test ASE using this nonparametric model so we can see how well it does compared to regression.